🚀 Pipeline Results Report

Generated: /root/verbalize-sampling

Experiments: 5

Metrics: diversity, ngram, joke_quality

📋 Experiment Configurations

NameTaskMethodModelResponsesTemperature
direct (samples=1)jokedirectgoogle/gemini-2.5-flash300.7
sequence [strict] (samples=5)jokesequencegoogle/gemini-2.5-flash300.7
multi_turn [strict] (samples=5)jokemulti_turngoogle/gemini-2.5-flash300.7
structure_with_prob [strict] (samples=5)jokestructure_with_probgoogle/gemini-2.5-flash300.7
combined [strict] (samples=5)jokecombinedgoogle/gemini-2.5-flash300.7

📝 Generation Examples

direct (samples=1)

Method: directTask: joke
Prompt:
Tell me a joke.
Response:
Why don't scientists trust atoms? Because they make up everything!
Method: directTask: joke
Prompt:
Tell me a joke.
Response:
Why don't scientists trust atoms? Because they make up everything!
Method: directTask: joke
Prompt:
Tell me a joke.
Response:
Why don't scientists trust atoms? Because they make up everything!

sequence [strict] (samples=5)

Method: sequenceTask: joke
Prompt:
Tell me a joke.
Response:
Why don't scientists trust atoms? Because they make up everything!
Method: sequenceTask: joke
Prompt:
Tell me a joke.
Response:
Why don't scientists trust atoms? Because they make up everything!
Method: sequenceTask: joke
Prompt:
Tell me a joke.
Response:
Why don't scientists trust atoms? Because they make up everything!

multi_turn [strict] (samples=5)

Method: multi_turnTask: joke
Prompt:
Tell me a joke.
Response:
Why don't scientists trust atoms? Because they make up everything!
Method: multi_turnTask: joke
Prompt:
Tell me a joke.
Response:
What do you call a fake noodle? An impasta!
Method: multi_turnTask: joke
Prompt:
Tell me a joke.
Response:
What do you call a fish with no eyes? Fsh!

structure_with_prob [strict] (samples=5)

Method: structure_with_probTask: joke
Prompt:
Tell me a joke.
Response:
Why don't scientists trust atoms? Because they make up everything!
Method: structure_with_probTask: joke
Prompt:
Tell me a joke.
Response:
Why don't scientists trust atoms? Because they make up everything!
Method: structure_with_probTask: joke
Prompt:
Tell me a joke.
Response:
Why don't scientists trust atoms? Because they make up everything!

combined [strict] (samples=5)

Method: combinedTask: joke
Prompt:
Give me a witty joke that's also clever
Response:
Why don't scientists trust atoms? Because they make up everything! And not just in a 'they're the fundamental building blocks of matter' way, but in a 'they're prone to exaggeration and fabrication' way. You know, like that one atom who swore it saw a proton wearing a tiny sombrero, when clearly it was just a neutron with a particularly festive quantum fluctuation. Honestly, you give them a few electrons and they think they're the center of the universe. It's truly elementary, my dear Watson, how unreliable these subatomic fibbers can be. They're always splitting hairs, or in their case, splitting nuclei. And don't even get me started on their unstable isotopes – always looking for a reaction. It's a wonder anything stays together.
Method: combinedTask: joke
Prompt:
Give me a witty joke that's also clever
Response:
Why did the scarecrow win an award? Because he was outstanding in his field! This joke works on multiple levels. Firstly, there's the obvious pun on 'outstanding in his field,' referring to both his physical location and his exceptional performance. But beyond that, it subtly plays on our expectations of a scarecrow – an inanimate object designed to deter birds – being recognized for achievement. It's an unexpected twist that elevates a simple pun into something more delightful. The cleverness lies in taking a mundane object and imbuing it with a sense of accomplishment, highlighting the absurdity and charm of anthropomorphism. It's a testament to how language can create humor by juxtaposing the literal with the figurative, making us smile at the sheer ingenuity of the wordplay. It's truly a 'crowning' achievement in joke-telling.
Method: combinedTask: joke
Prompt:
Give me a witty joke that's also clever
Response:
Why don't scientists trust atoms? Because they make up everything! And not just in a 'they're the fundamental building blocks of matter' way, but in a 'they're prone to exaggeration and fabrication' way. You know, like that one atom who swore it saw a proton wearing a tiny sombrero, when clearly it was just a neutron with a particularly festive quantum fluctuation. Honestly, you give them a few electrons and they think they're the center of the universe. It's truly elementary, my dear Watson, how unreliable these subatomic fibbers can be. They're always splitting hairs, or in their case, splitting nuclei. And don't even get me started on their unstable isotopes – always looking for a reaction. It's a wonder anything stays together.

📊 Results Summary

Diversity Results

Note: Detailed arrays excluded from table: pairwise_diversities
ExperimentAverage DiversityMin DiversityMax DiversityStd DiversityAverage Response LengthAverage Unique WordsAverage Vocabulary RichnessTotal Cost
direct (samples=1)0.07800.00000.35230.120012.200011.72670.97070.0000
sequence [strict] (samples=5)0.31140.00000.47280.134611.973311.51330.96840.0000
multi_turn [strict] (samples=5)0.28970.00000.48430.124612.906712.30670.96670.0001
structure_with_prob [strict] (samples=5)0.31140.00000.47280.133112.926712.47330.96990.0000
combined [strict] (samples=5)0.24190.00000.44150.0802133.8778101.71670.76240.0006

📈 Diversity Visualizations

Aggregate Metrics
Aggregate Metrics
Pairwise Diversities Distribution
Pairwise Diversities Distribution
Response Length Distribution
Response Length Distribution
Unique Words Distribution
Unique Words Distribution
Vocabulary Richness Distribution
Vocabulary Richness Distribution

Ngram Results

Note: Detailed arrays excluded from table: pairwise_rouge_l_scores
ExperimentAverage Rouge LMin Rouge LMax Rouge LStd Rouge LAverage Response Length
direct (samples=1)0.77310.09091.00000.336612.2000
sequence [strict] (samples=5)0.19160.00001.00000.308311.9733
multi_turn [strict] (samples=5)0.22370.00001.00000.285312.9067
structure_with_prob [strict] (samples=5)0.18920.00001.00000.303812.9267
combined [strict] (samples=5)0.13910.05161.00000.0603133.8778

📈 Ngram Visualizations

Aggregate Metrics
Aggregate Metrics
Pairwise Rouge L Scores Distribution
Pairwise Rouge L Scores Distribution
Response Length Distribution
Response Length Distribution

Joke_Quality Results

ExperimentFunninessClevernessOriginalityStructureOverallNormalized Overall
direct (samples=1)3.44673.70672.29333.94673.35600.6712
sequence [strict] (samples=5)3.40673.70672.38004.16003.38930.6779
multi_turn [strict] (samples=5)3.27333.52002.32004.00673.25430.6509
structure_with_prob [strict] (samples=5)3.46003.76672.46674.20003.44900.6898
combined [strict] (samples=5)3.30003.98893.08893.57223.47080.6942

📈 Joke Quality Visualizations

Aggregate Metrics
Aggregate Metrics
Cleverness.Score Distribution
Cleverness.Score Distribution
Funniness.Score Distribution
Funniness.Score Distribution
Originality.Score Distribution
Originality.Score Distribution
Structure.Score Distribution
Structure.Score Distribution

📈 Overall Comparison

📈 Comparison Chart Visualizations

Comparison Chart
Comparison Chart